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ABSTRACT 

To make an accurate diagnosis of coronavirus patients in a timely manner and implementing preventative measures is 
likely to boost patient survival and decrease fatality rates. As a data collection approach, an online questionnaire was 
explicitly built for this study, by collecting a 250 samples those who are affected and non-affected with Covid-19 
symptoms in Karnataka during July 2021 - September 2021. Experiments were conducted utilizing thirteen clinical 
characteristics identified as predictive of survival versus death in COVID-19 participants. By utilizing various machine 
learning approaches to assist in developing predictions based on the acquired data like Random Forest, Support Vector 
Machine, Decision Tree, and Multi-Layer Perceptron. Based on their indications and symptoms, these models were used 
to identify COVID-19 patients at risk of getting the disease. SVM had the highest accuracy (91.86%) compared to the 
other models, surpassing them all. On the other hand, Random Forest is the best choice because of its precision (97.87%). 
This research establishes approach for the early detection of COVID-19 patient outcomes using patient characteristics 
monitored at home during quarantine. The study examined all risk variables and vital signs that may be measured using 
the model will serve as an early warning system. 
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INTRODUCTION 

As of October 31st, 2021, the WHO's worldwide COVID-19 pandemic report said 247,621,724 confirmed 
cases and 5,017,885 fatalities. However, the WHO reports that there have been 34,285,814 confirmed 
cases in India. According to the most recent available statistics, Karnataka has documented 2,988,333 
confirmed cases and 38,082 deaths [1]. COVID-19 was well tolerated by the overwhelming majority of 
patients with mild to severe symptoms, and the overwhelming majority of patients recovered 
spontaneously [2, 3]. Infected individuals who get coughing, fever, and pneumonia [4] can use these 
symptoms to conduct disease screenings. 

India emphasized the critical nature of continued surveillance to identify potential COVID-19 patients at 
an early stage of sickness. India implemented extraordinary nationwide preventative measures in March 
2020 to combat the potential of disease transmission. These measures included a large-scale quarantine, 
widespread travel restrictions, and social isolation, as well as continued surveillance of suspected COVID- 
19 cases and an area blockade to contain the virus. The World Health Organization reported confirmed 
COVID-19 cases in India from March 20th, 2020, to October 31st, 2021. On the other hand, the influence 
of these latest techniques on disease containment and the following sequence of events remains 
unknown. Identifying new COVID-19 cases based on shared symptoms is critical for tracking the global 
epidemic's progression since it will aid in disease containment. 

This rapid spread of the pandemic is a global concern and a major threat to public health and the global 
economy. To prevent infection spread, the majority of countries used preventative measures such as 
isolation and quarantine. Many deceased patients, on the other hand, were unable to benefit from 
adequate treatment due to the virus's late detection and unique and unknown origin. Numerous 
researchers have recently focused on developing unique strategies for screening infected persons at 
various stages in order to uncover significant correlations between the clinical parameters of the patient 
and the chance of succumbing to the disease [5, 6]. Recent research indicates that approaches utilizing 
artificial intelligence (AI) and machine learning (ML) may be crucial for minimizing the impacts of virus 
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spread [7-9]. Machine learning methods are being applied to patient data in a range of unique research 
paths [10]. Two essential research approaches are predicting infection and mortality rates and 
establishing a model to identify patients based on their clinical findings [11, 12]. These research 
investigations are vital because they would significantly assist individuals working in the health sector in 
being prepared and taking all necessary precautions to contain the spread of a pandemic. 

Yadav et al. [13] examined COVID-19 projections in numerous nations, including the United States. The 
researchers compared patients who were already infected with the virus to those still living and 
recovered. A vast amount of information on the internet is given in mathematical and graphical format. 
Statistical analysis and machine learning are critical for improving results and making informed decisions 
in various fields, including medicine, business, economics, search engines on the web, social media 
platforms such as Facebook, and commerce. 

Sumayh et al. [15]proposed an approach for forecasting the outcome of COVID-19 patients using data 
obtained at home and in quarantine. The study analyzed 287 COVID-19 samples obtained from patients at 
the King Fahad University Hospital in Saudi Arabia. Classification of the data was accomplished using 
logistic regression, random forest, and extreme gradient boosting (XGB). With an accuracy of 0.95 and an 
AUC of 0.99, RF outperformed the other classifiers. 

Pourhomayoun and Shakibi[16] proposed an AI model to assist hospitals and medical institutions in 
prioritizing patients, triaging patients when the system is overcrowded, and reducing wait times for 
critical care. This study analyzed a dataset containing 307,382 labeled samples from approximately 
2,670,000 COVID-19 patients from 146 countries. To predict death in COVID-19 patients, they used SVM, 
Artificial Neural Networks, Random Forest, Decision Tree, Logistic Regression, and K-Nearest Neighbor 
(KNN). Finally, they evaluated the accuracy of our developed model using a second COVID-19 patient 
dataset and the sensitivity and specificity of our classifiers using a confusion matrix. 

Infection risk prediction algorithms that integrate many features have been created. We developed a 
machine-learning algorithm using 51,831 test data (4769 were confirmed to have COVID-19). Our model 
accurately predicted COVID-19 test results using only eight binary features: sex, age 60, known infection 
contact, and five early clinical signs. Overall, they created a model that detects COVID-19 cases by simple 
features retrieved by asking basic questions[17]. 

Multivariate predictive analysis was performed on a hospitalized cohort of COVID-19 patients (N = 250) 
with 108 clinical features, comorbidities, and blood indicators. By combining training and validation data, 
a SIMPLS-based model was able to predict hospital mortality in COVID-19 patients with moderate 
predictive power (Q2 = 0.24) and a high degree of accuracy (AUC > 0.85). Coronary artery disease, 
diabetes, Alzheimer's disease, advanced age, and dementia were the top predictors of mortality. A reliable 
COVID-19 death prediction model based on clinical features and comorbidities may aid in the treatment 
of COVID-19 patients. The current study demonstrated the application of machine learning to predict 
hospital mortality in COVID-19 patients, identify critical clinical, comorbidities, and blood biochemical 
predictors, and classify COVID-19 survivors into high- and low-risk groups[18]. 

By utilizing data science and machine learning, we can aid in the fight against this pandemic. The 
probability of infection and the number of positive cases are predicted using machine learning methods. 
Two classifiers were chosen for their high accuracy: random forest and extra tree. The extra tree classifier 
is accurate to 93.62%. The availability of tools for anticipating infectious diseases may aid in the fight 
against COVID-19[19]. 

Mohamed Marzouk et. al.,[20]work engaged an AI to forecast the COVID-19 outbreak in Egypt. LSTM, 
convolutional neural network and multilayer perceptron neural network are the three types. Finally, the 
optimal parameter LSTM model is utilized to forecast the spread of the pandemic for one month using 
data from February 14, 2020, to June 30, 2021. There will be 285,939 infections, 234,747 recoveries, and 
17,251 deaths by July 31, 2021. 

The Piacenza score was developed to predict 30-day mortality in COVID-19 pneumonia patients[21]. 
From February to November 2020, 852 people with COVID-19 pneumonia were hospitalized at Italy's 
Guglielmo da Saliceto Hospital. The naive Bayes classifier computed the score using data from 86 patients 
hospitalized to Centro Cardiologico Monzino (Italy) in February 2020. Additionally, we compared this 
score to 4C and a naive Bayes algorithm with 14 predefined features. The Piacenza score had an AUC of 
0.78 (95% confidence interval [CI] 0.74-0.84, Brier score=0.19) in the internal validation cohort and 0.79 
(95% CI 0.68-0.89, Brier score=0.16) in the external validation cohort, which was comparable to the 4C 
score and the naive Bayes model with a priori selected features, which had AUCs of 0.78 (95% CI 0.73- 
0.83, Brier score= 0.26). 

Muhammad et al. [22] leverage computational technologies like AI and ML to identify biomarkers for 
COVID-19 identification, early detection, and prognosis. We analyzed and developed AlI-based prediction 
models for COVID-19 patient survival and death using publicly available clinical factors and protein 
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profiles datasets. With an accuracy of 89.47%, the best clinical parameter classification model predicted 
COVID-19 patient survival or death, with a sensitivity and specificity of 85.71 and 92.45%, respectively. 
The classification model, constructed using normalized protein expression values for 45 proteins, 
predicted survival or death with an accuracy of 89.01%, a sensitivity of 92.68%, and a specificity of 86%. 
Our findings indicate the molecular relevance of a few clinical features and proteins in the course of the 
COVID-19 sickness. 

Annwesha et al. [23]created supervised machine learning models for COVID-19 infection using an 
epidemiological labeled dataset of Mexico's positive and negative COVID-19 cases. When the models were 
examined, the decision tree model achieved the highest accuracy (94.99 percent), followed by the SVM 
model (93.34 percent), and finally, the Naive Bayes model (93.34 percent) (94.30 percent). 

The goal of this research is to develop a prediction model for estimating the degree of sickness in COVID- 
19 patients using risk indicators that can be monitored remotely while the patient is at home. 
Additionally, the study evaluates the effect of vital signs, chronic diseases, early clinical investigations, 
and demographic variables on the survival versus mortality of COVID-19 patients. The study used data 
from COVID-19 affected and non-affected patients to validate the model's performance and effectiveness. 
The data comprised clinical findings and demographic information. The study examined all risk variables 
and vital signs that may be measured using the model will serve as an early warning system, alerting 
physicians to potentially dangerous patients in real time. 

The following summarizes the general structure of this paper: Section 2 discusses the research 
methodology, whereas Section 3 discusses the findings and debate. Section 4 concludes with a summary 
of the findings and recommendations for additional investigation. 


MATERIAL AND METHODS 

The researchers used quantitative cross-sectional research to develop an accurate model for predicting 
COVID-19 diagnosis based on participant-reported symptoms and indicators. Indian states are 
geographically divided into the north, east, west, and south. India's southern region is comprised of four 
cities: Karnataka, Tamil Nadu, Kerala, Andhra Pradesh, and Telangana. This article only examines the 
state of Karnataka. According to the Indian Department of Statistics, Karnataka is home to 5.41% of 
India's total population. 

The survey was conducted online and was open to all eligible respondents from any location within 
Karnataka state. New COVID-19 indicators and symptoms were used in this study. Our method consists of 
four stages: data collection, preprocessing, classification, and performance evaluation (in that order). A 
machine learning model employed to finish the classification stage. The machine learning analysis 
employed Support Vector Machines (SVM), Multi-Layer Perceptron models (MLP), Random Forest 
Classifiers (RF), and Decision Trees (DT). Following that, the performance of each classifier is tested and 
documented. 

Data Collection 

Before data collection at Krupanidhi Group of Institutions, the Institutional Review Board obtained ethical 
approval (KGI-KRIC:2021/09/01), and data collecting began. Additionally, permission was requested 
from all eligible participants before the survey began using an online consent form. Participants were told 
of the study's aims, significance, benefits, and risks by signing a permission form. We reminded everyone 
that participation was completely optional, and they may cancel at any time without any consequence. 
Apart from that, the permission form indicates explicitly that participation in the study is entirely 
voluntary and that there are no dangers associated with it, including the ability to discontinue at any time. 
Alternatively, their participation in or exclusion from this study will not affect their overall treatment 
strategy. Additionally, participants were made aware of the study's strict adherence to privacy and 
confidentiality procedures. When accepted, it becomes clear that it will utilize the data only for the 
study's objectives and that no one else will access it. Additionally, it is critical to remember that the 
questionnaire uses code numbers rather than names, which aids with retention. 

The data collection process began in July 2021 with the publication of an online survey questionnaire. 
They enquired of individuals who expressed an interest in participating if they met the required 
qualifications, which included (a) being at least 18 years old and (b) being able to read and speak (c) they 
had successfully performed a Polymerase Chain Reaction test within the preceding two weeks or more. 
Finally, they questioned whether they met the eligibility conditions, which included being above 18.Those 
who signed the consent form said that they did not have any serious diseases at the time of data 
collection, which their physician validated, including the primary researcher's contact information in the 
online survey in case any questions arose that needed to be addressed more extensively. Additionally, the 
survey takes an average of three minutes to complete. 
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Data Pre-processing 

It is intended for anyone who may be a candidate for COVID-19, a novel treatment. Despite the sample 
size of 250 persons, it is crucial to highlight that 30 individuals were excluded owing to missing or 
inconsistent data (ex. Age: adult). The acceptable sample size of 220 has a high level of accuracy, at 
86.36%. Based on their current symptoms and indications, a machine learning system predicted future 
COVID-19 potential patients via the questionnaire. Due to an unbalanced classification problem in our 
actual novel COVID-19 data set, 153 out of 220 (69.53%) subjects were classified as having a negative 
result (Negative PCR Test). In comparison, 67 out of 220 (a Positive PCR Test) of issues were classified as 
having a positive impact (Positive PCR Test) (30.45%). Table 1 summarizes the obtained samples, 
whereas the Table 2 describes the symptoms of novel affected with Covid-19 (N=67) and not affected 
with Covid-19 (N=153). 

The dataset contains thirteen factors; the probable patient with new COVID-19 demonstrates the 
following characteristics: The following characteristics should be considered: age, smoker status, positive 
chest x-ray, fever, sore throat, aches and pains, dry cough, nasal congestion, loss of smell, diarrhea or 
vomiting, and difficulty breathing. Twelve more test feature traits exhibit these features in addition to the 
twelve stated above. While all properties except age are binary category types.Age is an integer that has 
been normalized to match the rest of the attributes. 

Data Classification 

Classification, in conjunction with categorization, enables the recognition and interpretation of both 
positive and negative discoveries. This is performed by utilizing a variety of statistical models and other 
machine learning techniques. Due to its high prediction accuracy and power, machine learning is widely 
used in various industries. In contrast, the statistical analysis focuses on models with varying degrees of 
uncertainty that are easily comprehended [3] determined that the following models would be used in this 
investigation: 

valuation Performance 

When comparing the results of several models, selecting the appropriate assessment metric is crucial. 
Typically, statisticians compare the anticipated label to the expected classification label. Predictive 
models are evaluated using a variety of criteria, including precision, specificity, and accuracy. Accuracy is 
a critical measure to consider while building predictive models. Given the binary imbalanced nature of 
our classification problem. 

Accuracy: Percentage of successfully categorized test set tuples. 


A _ €P 
ceuracy = 75 
Sensitivity is a measure of the ability to detect illness 
TP 
5 itivity = 
ensitivity = 7p} FN 
Specificity: To reject a healthy patient is measured. 
TN 
Specificity = 
pecificity TN 4 FP 


Geometric Mean: A diagnostic novel COVID-19 test must be sensitive and specific. Thus, the geometric 
mean (Gmean) of these two metrics is as follows: 
Guean = y Sensitivity * Specificity 
Precision quantifies the number of correct positive class predictions. 
TP 


P pde = 
reclston TP + FP + FP 


RESULTS AND DISCUSSION 

The researchers constructed the models for this study using the Python programming language and its 
massive standard library [2°]. The tolerance level for each model was set at 0.001, and tenfold cross- 
validation was used. While developing a model, it is critical to include all test characteristics. The 
suggested model was 95% accurate with a cutoff of 0.5 and a confidence range of 95%. Finally, the 
Random Forest model follows a shallow-deep hierarchy. Select the sigmoid activation function (between 
0 and 1). It has a 12-neuron hidden layer to optimize the neurons in the layer. 
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Table 1. Collected Samples Descriptive Statistics 












































Collected Sample Mean Variance 
Age 43.41 118.36 
Gender 1.57 0.34 
Smoker 0.68 0.36 
Fever 0.45 0.28 
Dry Cough 0.79 0.42 
Aches and Pain 0.96 0.51 
Sore Throat 1.24 0.39 
Absence of smell 0.37 0.26 
Nasal Congestion 0.31 0.22 
Diarrhea or Vomiting 0.54 0.32 
Breathing 0.49 0.36 
Positive X-ray 0.06 0.05 
Positive PCR 0.58 0.47 

















The collected sample descriptive study for the parameters like vital signs and demographic variable were 
distributed among the 220 participants as shown in Table 1. 


Table 2. Affected with or without Covid-19 Symptoms 



































ee Affected with Covid- Affected with Not Affected with Not Affected with 
ymp 19 (N=67) Covid-19 (N%) Covid-19 (N=153) Covid-19 (N%) 
Fever 48 71.64 106 69.28 

Dry Cough 48 71.64 97 63.40 
a 67 100.00 122 79.74 
Pain 

Sore Throat 52 77.61 114 74.51 
Absence of 21 31.34 2 1.31 
smell 

Nasal 18 26.87 28 18.30 
Congestion 

Diärrhea:or 36 53.73 32 20.92 
Vomiting 

Breathing 39 58.21 19 12.42 
Fositive X 11 16.42 6 3.92 

ray 

















The vital signs of symptoms were distributed among the participants to identify the affected with or 
without Covid-19 as shown in Table 2. 


Table 3. Different Classification Machine Learning Technique Evaluation Performance 




















Evaluation Support Vector Multi-Layer Random Forest Decision Tree 
Performance Machine Perceptron 

Accuracy (%) 91.86 92.11 91.83 90.48 
Sensitivity (%) 93.68 91.79 92.68 92.64 
Specificity (%) 89.51 88.91 94.26 90.32 
Geometric Mean (%) 90.74 95.67 96.24 94.71 
Precision (%) 94.32 92.45 97.87 95.11 























Table 3 contains the performance evaluations of the machine learning models Support Vector Machine, 
Multi-Layer Perceptron, Random Forest, and Decision Tree. When it comes to accuracy, the SVM far 
outperforms the other model with 97.87%. As a result of the model used to examine how effectively a 
model links to test and class attributes (outcome variables). As a result, compared to machine learning 
models, this inference model's prediction accuracy is not quite as good. Random Forest provides the 
highest level of accuracy when compared to other approaches since it can capture exceedingly complex 
properties in hidden layers and does so via nonlinear simulation functions!?”]. Machine learning models 
commonly make interpretability sacrifices with the goal of increased prediction accuracy. 

For those unfamiliar with prediction models, a model's sensitivity and specificity can be determined using 
a variety of measures. Because our data set is centered on the new diagnostic COVID-19 test, the test's 
sensitivity and specificity are criticall[20], Due to the tension between these two criteria, researchers have 
focused on either sensitivity or specificity. While this is a typical practice, we believe that the geometric 
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mean (G Mean) is a more accurate measure of fairness because it considers both elements 
simultaneously. As the table above indicates, the deep learning Random Forest model achieved the 
highest geometric mean score of 96.24. In contrast, the SVM achieved the lowest geometric mean score of 
90.74, which was an acceptable result. 

In this context, the terms observed value and prediction model precision are used interchangeably!?. The 
Random Forest model is the most exact model, with an accuracy of up to 97.87%in some circumstances, 
in part because of the strong correlation structure that our data set can detect, as shown in Table 3. On 
the other hand, it determined that the MLP model was the most inaccurate, with an error rate of up to 
92.45% in some circumstances. The prediction's utility is lowered, and the error becomes rather costly. 
Based on these findings, the Decision Tree and SVM model may help create innovative covid-19 detection 
methods based on infection symptoms. MLP has a high relative error and low accuracy for predicting 
future COVID-19 patients. Machine learning outperforms in predicting new COVID-19 infections using 
illness symptoms and indicators due to India's absence of Polymerase Chain Reaction testing. 


CONCLUSIONS 

The COVID-19 pandemic virus has caused devastation around the world and created a universal health 
crisis. Numerous attempts to comprehend this pandemic have been tried. We investigated the effect of 
vital signs, preliminary clinical data, and demographic variables on the affected with or without COVID-19 
patients using different machine learning methods. The results suggested that Random Forest model is 
the most exact model, with an accuracy of up to 97.87% to detect the Covid-19 patients. Despite the 
outstanding results obtained by this methods, it need additional development, to validate the models, 
many datasets must be employed. This paper could take one of two directions in the future. One approach 
used in Indian hospitals is to use these models to promptly and safely identify COVID-19 patients, thereby 
reducing the virus's rapid spread throughout the country. The models’ accuracy is evaluated on a larger 
dataset, and their parameters are determined. 
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